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variable MCMC simulation with the expectation of interest expressed as an integrated 
set of rare event probabilities and derive our estimator from a Rao-Blackwellised esti- 
mate of a marginal auxiliary variable distribution. We illustrate our method with two 
applications. First, we compute a shortest path rare event probability and compare 
our method to estimation to a cross entropy approach. Then, we compute a normali- 
sation constant of a high dimensional mixture of Gaussians and compare our estimate 
to one based on nested sampling. Finally, we discuss the relationship between our 
method and alternatives such as bridge sampling and the Wang-Landau algorithm. 
The methods developed here are available in the R package: SplitSampling. 

Keywords: Rare Event, Normalisation, Cross Entropy, Slice Sampling, MCMC, Im- 
portance Sampling, Split Sampling, Serial Tempering, Annealing, Adaptive MCMC, Wang- 
Landau, Nested Sampling, Bridge Sampling. 

*E-mail: john.birge@chicagobooth.edu, changgee@uchicago.edu, ngp@chicagobooth.edu. We would like 
to thank the participants at the conference in honor of Pietrio Muliere at Bocconi, September 13-15, 2012 
for their helpful comments. The authors' work was supported by the University of Chicago Booth School 
of Business. 



1 Introduction 



In this paper we develop a methodology we refer to as split sampling methods to provide 
more precise estimates for expectations in high dimensions such as normalisation constants 
and rare event probabilities. We show that more precise estimators can be achieved by 
splitting the expectation of interest into a number of easier-to-estimate normalisation con- 
stants and then integrating those estimates to produce an estimate of the full expectation. 
To do this, we employ a family of splitting functions indexed by a random auxiliary vari- 
able and develop a weighting function with normalisation constants to generate the overall 
estimator. We allow for an adaptive MCMC approach to specify our weighting function. 
Other variance reduction techniques, such as control variates will provide further efficiency 
gains (see Dellaportas and Kontoyiannis, 2012, Mira et al, 2012). 

There are two related approaches in the literature. One approach is nested sampling 
for estimation of normalisation constants which sequentially estimates the quantiles of the 
likelihood function under the prior to provide an estimator. Other normalisation meth- 
ods include bridge and path sampling (Meng and Wong, 1996, Gelman and Meng, 1998), 
generalized versions of the Wang-Landau algorithm (Wang and Landau, 2001), and the 
TPA algorithm of Huber and Schott (2010). Serial tempering (Geyer, 2010) and linked 
importance sampling (Neal, 2005) provide ratios of normalisation constants for a discrete 
set of unnormalised densities. A second approach is cross entropy (Rubinstein and Glynn, 
2009, Asmussen et al, 2012) which sequentially constructs an optimal variance-reducing 
importance function for calculating rare event probabilities. The product estimator (Dia- 
conis and Holmes, 1994, Fishman, 1994) "splits" the rare event probability into a set of 
relatively large conditional probabilities whchi are easier to estimate. 

We now introduce the notation to characterize the estimation problems and to develop 
our method. The central problem is to calculate an expectation of a positive functional 
of interest, which we denote as £(x), under a /c-dimensional probability distribution 7r(x). 
We write this expectation as: 

Z = E n (L(x)) = / L(x)7r(x)dx . 
Jx 

The standard Monte Carlo estimate Z = (1/iV) Y2iLi L{x ( - t " > ) where is drawn from vr(x), 
possibly via MCMC, is too inaccurate. We introduce an auxiliary variable, m, a family of 
splitting functions, L m (x), and their normalisation constants, Z m = f x L m (x)7r(x)<ix. For 
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rare events L m (x) = I(L(x) > to); for normalisation constants L m (x) = exp (— mif(x)) 
and for counting problems L m (x) = #(x : L(x) > to). In general, the task at hand is 
NP-hard. We wish to develop a fully polynomial randomised approximation scheme. 

We will focus on the case where the splitting functions L m (x) are specified by level sets 
in the likelihood, namely L m (x) = I(£(x) > to). This occurs naturally in the rare event 
simulation literature and in nested sampling. The marginal distribution on to is defined 
via a weight function ou(m) with normalisation constants Z m . The joint importance split 
distribution, nssi*-, to), uses this marginal and the set of "tilted" posterior distributions. 
We assume that the researcher has a fast MCMC algorithm available for sampling the joint 
distribution. 

The rest of the paper is outlined as follows. Section 2 details our split sampling method- 
ology. We develop the key identity that "splits" the expectation of interest into an inte- 
grated set of rare event normalisation constants. MCMC then provides an estimator of the 
marginal distribution of the auxiliary variable, which in turn provides our overall estima- 
tor. We provide a number of guidelines for specifying our weight function. In the discrete 
case, we need to specify a "cooling schedule" for to, denoted by {m t }J =0 , and adaptively 
estimate the weights to provide the greatest variance reduction. Section 3 describes the 
relationship with nested sampling methods. We show how to choose the weight function 
to mimic the sampling behavior of nested sampling. Section 4 applies our methodology to 
a shortest path rare event probability and to the calculation of a normalisation constant 
for a spike-and-slab mixture of Gaussians. We illustrate the efficiency gains of split sam- 
pling over crude Monte Carlo, the conditional probability estimator (Diaconis and Holmes, 
1994, Fishman, 1994) and the cross entropy method. Finally, Section 5 concludes with an 
importance sampling view of nested and split sampling and directions for future research. 



Split sampling works as follows. Given a family of splitting functions L m (x) indexed by a 
random auxiliary variable to, we define the set of "tilted" distributions and corresponding 
normalisation constants by 



2 Split Sampling 



7T m (x) 




where 




in 
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A case of particular interest occurs when L m (x) = I (L(x) > to). In this case the tilted 
distribution 7r m (x) ~ 7r(x|L(x) > to) corresponds to conditioning on level sets of L(x). 
The Z m 's correspond to "rare" event probabilities 



L m (x)7r(x)(ix = / 7r(x)dx = P„. (L(x) > to) 

A" J L(x)>m 



We interpret 7r(x) as a "prior" distribution, L(x) as a likelihood, 7Tm(^) = L(x)ir(x.) / Z as a 
"posterior" distribution. The key to our approach will be the specification and estimation 
of the marginal distribution of to. 

The expectation of interest, Z, can be viewed as the integration of normalisation con- 
stants / °° Z m dm via the key identity 

Z — L(x)7r(x)dx = / < / 7r(x)dTO > rfx = / Z m dm . 

Jx Jo Um<L(x) J JO 

By this approach, we have "split" the computation of Z into a set of easier to compute 
normalisation constants Z m . 



2.1 Mixture Importance Splitting 

Split sampling will simultaneously estimate Z m and Z = J °° Z m dm. First, we need to 
specify an appropriate mixture importance sampling distribution. Given weights, ou(m), 
we specify an importance splitting density on to by 

u(m)Z m 
J (jj(m)Z m am 

The joint distribution of nssfe, to) has conditional posterior, 7155- (x|to) = 7r m (x), and is 

/ x / I \ u{m)Z m 

«ss(x,m)=nss(x\m)- iruj{m)Zmdm . 

When L m (x) = I(£(x) > to), the discrete joint distribution is given by 

I(L(x) > ^)tt(x) CU^m t 
7r 5S (x,TO t ) = 7r 5 5(TO t )7r 5 5(x|TO = TO t ) = — y — . 

m i 2^=o u tZ mt 

The conditional distribution for m — m t given x is given by 

7r(TO t |x) = — y , < t < T. 

2^=0 W s Ji {L(x)>m s } 
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The key feature of split sampling is that MCMC draws are available as is an efficient Rao- 
Blackwellised estimate of the marginal Ttssim) without the knowledge of the Z m 's. From 
the marginal, we will estimate Z m and Z = J °° Z m dm. 

Samples from (x, m)W ~ tt ss (x, m) are available with Gibbs sampling which iterates 
the conditionals 7r(x|m) ~ 7r m (x) and 7r(m|x). When L m (x) = I(L(x) > m), this reduces 
to running a Markov chain with density 7r(x) conditioned to the set {x : L(x) > m}, 
namely 

7r(x|m) ~ 7T (x|L(x) > m) . 

We can sample ir(xj\x.-j, m) in a component-wise fashion as well. Additional auxiliary 
variables can be added to improve the speed of convergence (Poison, 1996). 

Roberts (2010) observes that MCMC rather than Bayesian inference will achieve the 
largest efficiency gains in rare event probabilities (Glasserman et al, 1999, Glynn et al, 
2010) and in counting problems where the resultant chains can be hard to sample exactly. 

The complete conditional distribution for m|x does not depend on Z m and is given by 

L m (x)7r(x) w(m)L m (x) 
7r (m x oc u}(m)Z„ - 



J m 



u(m)L rn (x.)dm 



Given a discrete "cooling schedule" , denoted by {mo, • • . , mr}, we set u(m) = Ylt=i u tSm t (m), 
where 5 is a Dirac delta function. To achieve significant variance reduction gains, we will 
construct an adaptive MCMC method to determine an "optimal" schedule and weights. 
For rare events, we only require Z{^j) = P (L(x) > 7) and we set oj{m) = 0,m > 7. 

2.2 The Estimator 

We now derive estimators for Z m and Z. By construction, the joint mixture importance 
splitting distribution is 

/V ^\ u{m)Z m L m (x)7r(x) 

7r S5 (x,m) = . 

J u{s)Z s ds Z rn 



5 



Given samples xW ~ 71-55 (x), we can exploit a Rao-Blackwellised estimator for the marginal 
via the identity iiss( m ) which leads to the estimator 



1 N 

Kss(m) = E (tt S5 (x, to)) = — ^ 7r 55 (m|x w ) 



v 

A' 

i=i 



1 E 



N U /o°° w(m)L m (xW)dm ' 
Given our marginal density estimator, 71-55 (m), we can estimate Z m via 

= ^(0) TT S s(m) 

Z u(m) 7T S s(0) 
As Z — 1, it is natural to pick u;(0) = 1. Our new estimate of Z m is then 

-1 TTssM 



uj{m)~ 



7T5S(0) 



Convergence of Z m — > Z m is straightforward. By the ergodic theorem, 7Tss(m) — > 7T,ss(to) Vm. 
Therefore nss( m )/^ss(^) nss(™)/'Kss(Q)- The correction factor nssi^/^ssi^) needs 
to be estimated as accurately as possible. To do this we will use an adaptive choice of the 
weight function, ui(m) and use convergence results from the adaptive MCMC literature. 
When L m (x) = I (L(x) > m), the splitting density is 

7r(x,m) = 7r(m)7r(x|M = m) = v < m < 00 ■ 

J u(s)Z(s)ds Z m 

The posterior conditional of the auxiliary index m given x is 

7r(m|x) = w(m)I(L(x) > m)/fi(L(x)) where Q(m) = / u(s)ds . 

Jo 

The marginal density estimator of m is 



m)c(m) where ^(m) = - ^ n(L(x ( 0)) 
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This is a re- weighted version of the initial weights ou(m). 
The marginal density of x is given by 

n(L(x))7T(x) 
7TSS(XJ = . 

Jo w(s)Z(s)ds 

This provides a new estimator Z m , where xW ~ ttss^), and 

^ u(0)7c(m) = ExW^xO)^^ 1 ^^)) 
m u(m)9(0) ^f =1 0-i(L(x(«))) 

To find Z = J °° Z m dm, we use the summation-by-parts counterpart to Fubini, namely 

N 

n-\L(x.®))dm = (^(x (i) )) L(x w ) . 

xW : 

Therefore, we have 



I 



" -.« : L(x(0)> m i=l 



Z = V — . — ^ — — L(x^) . 



We now describe our split sampling algorithm. 

Algorithm: Split Sampling 

• Draw samples (x, m)W ~ 7r 55 (x, m) by iterating 7155 (x|m) and 7r 5 5 (m|x) 

• Estimate the marginal distribution 7r S s( m ) via 

w(m)L m (x w ) 



N tT Jo w(m)L m (x(0)dm 
• The new estimate of the individual normalisation constants, Z m , is 



Z m = u; ^m) 



7r S5 (mj 



7T5S(0) 

• Compute a new estimate, Z, via 

z = y — ^ — — — l(x^) 
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In the discrete case with a given grid, uj(m) = Ylt=o UJ tfim t (. m ) ■ The marginal probabil- 
ities 7T t = 7r(m t ) are estimated by Rao-Blackwellization as 

i N 1 ^ TT 

1 c i d)\ 1 w ' J1 {i(x('))> mt } n . . ^ ™ 

ff t^L ^Kl* 1 j ) = a7 2^ ; > < t < T. 



i=l 



1 XL=0 W sI{L(xW)>m s } 



With Z = 1, the estimator is Z t = uj^t/^t^o for < t < T. 

There is an equivalence with the product estimator where Z = Z Yl t=1 Z mt /Z mt -i over 
a discrete grid m t of m-values. Variance reduction is achieved by splitting Z into pieces 
Z mt /Z mt _i of larger magnitude which are relatively easier to estimate. 

With x.^ ~ 7r mt _ 1 (x), we estimate 



^mt /" ^mt(x) Z m 1 ^ L m (x^) 

7 = / 7 M^'^W^ Wlth 7 = AT 2^ r ; (Ox 



Given N independent samples from T tilted distributions, we have 



t=l i=i ^m t -H x t J 

The product estimator, as well as the cross-entropy estimator, relies on a set of independent 
samples drawn in a sequential fashion. Split sampling, on the other hand, uses a fast 
MCMC and ergodic averaging to provide an estimate Z. The Monte Carlo standard error 
v ar(nss( m )/^ss(0)) can be determined from the output of the chain. 

Controlling the Monte Carlo error is straightforward due to independent samples with 
relative mean squared error, e.g. Garvels et al (2002), given by 



with mean K(Z m J Z mt _J = fi mt and variance <r^ ( . Picking m t such that the coefficients of 
variation are all the same leads to an estimator with reduced variance. 

Huber and Schott (2010) discusses the theoretical advantages of constructing a "well- 
balanced" cooling schedule. Using Chebyshev-Hoeffding bounds, we obtain a running time 
of O ((ln(l/Z)) 2 ) steps. Stefankovic, Vempola and Vigoda (2009) provide theoretical argu- 
ments that can reduce the running times to 0(ln(l/Z)) steps. The empirical test of our 
mixture IS algorithm is whether there exists a fast MCMC algorithm for sampling 7r(x, m) 
that reduces the O(NT) effort in sequential methods. 
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2.3 Choice of uj(m) 

The previous subsection assumed that u(m) is fixed. To improve the efficiency of our 
algorithm, we can use an adaptive M CMC approach to specifying the weight function. 

A common initialisation is to set u(m) = 1, Mm. Then, Q (L(x)) = u s ds = L(x). 
This leads to an estimate of the marginal, /in (m) = nss(™>), given by the measure 



I i=i 



, I{m < L(x«)} , 



The density estimate will be zero for m > L (maxjxW) by construction. We also have a 
set of estimates Z' 1 = £ m<L(xW) L' 1 ^)/ ^ L~\^) where x« ~ tt 55 (x). 

Given an initial run of the algorithm, we can re-proportion the prior weight function to 
regions we have not visited frequently enough. To accomplish this, let (f>(m) be a desired 
target distribution for 7iss( m ), for example a uniform measure. Then re-balance the weights 
inversely proportional to the visitation probabilities and set the new weights u*(m) by 

w*(m) _ u){m) _ l{m < L(x»)} 
u(m) /iAr(m) Q(L(xW)) 

This will only adjust our weights in the region where m < maxj L(x^). As the algorithm 
proceeds we will sample regions of higher likelihood values and further adapative our weight 
function. 

Other choices for weights are available. For example, in many normalisation problems 
Z m will be exponential in m due to a Laplace approximation argument. This suggests 
taking an exponential weighting u>(m) = ne Km for some k > 0. In this case, we have 

t-rn AM 

n(m) = / u(s)ds = e K{mAM) - 1 . 
Jo 

The marginal distribution is 

( e «(L(x)AM) _ -Q^) 
7TSS(X) = 75o — • 

J Ke KS Z s as 

We can also specify uj(m) to deal with the possibility that the chain might not have visited 
all states by setting a threshold oo max which corresponds to the maximum allowable increase 
in the log-prior weights. This leads to a re-balancing rule 

w*(m) . Jmax m /i 7V (m) 

- mm < — r , e mt 



u>(m) [ /xjv(^) 
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where we have also re-normalised the value of the largest state to one. 

When L max is available, we set M = L max and oj(m) = for m > M. To initialise 
u)(m), we use the harmonic mean Z^ for u>(M) and an exponential interpolation for 
u(m). Drawing xW ~ 7r M (x) = L(x)7r(x)/Z, we have 



1 N 



Zm = E„ M (L^W) to estimate u(M) = - ^ ^(x (i) ) . 



i=i 



The harmonic mean estimator (Raftery et al, 2007) is known to have poor Monte Carlo 
error variance properties (Poison, 2006, Wolpert and Schmidler, 2012) although we are 
estimating u(m) and not its inverse. 

We can extend this insight to a fully adaptive update rule for ujy(m), similar to stochas- 
tic approximation schemes. Define a sequence of decreasing positive step sizes j n with 
J2^=i In 1 = 00 > J2^Li In 2 < oo. A practical recommendation is j n = CrT a where 
a G [0.6,0.7], see e.g. Sato and Ishii (2000). Another approach is to wait until a "flat 
histogram" (FH) condition holds: 



max 

m£{m t } 



H N (m) - <p(m) 



< c . 



for a pre-specified tolerance threshold, c. The measure //jv(m) = (1/iV) J2f=i # = m ) 
tracks our current estimate of the marginal auxiliary variable distribution. The Rao- 
Blackwellised estimate 7Tss( m ) further reduces variance. 

The empirical measure can be used to update ^(m) as the chain progresses. Let k n 
denote the points at which r y KN will be decreased according to its schedule. Then an update 
rule whchi guarantees convergence is to set 

log w % (m) <- logw KAr _ 1 (m) + 7 Kjv (fi KN (m) - <f>{m)) . 

Jacob and Ryder (2012) show that if 7at is only updated on a sequence of values kn 
which correspond to times that a "flat-histogram" criterion is satisfied, then convergence 
ensues and the FH criteria is achieved in finite time. After updating u KN (m), we re-set 
the counting measure jj, KN {m) and continue. Other adaptive MCMC convergence methods 
are available in Atchade and Liu (2010), Liang et al (2007), and Zhou and Wong (2008). 
Bornn et al (2012) provides a parallelisable algorithm for further efficiency gains. Peskun 
(1973) provides theoretical results on optimal MCMC chains to minimise the variance of 
MCMC functionals. 
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One desirable Monte Carlo property for an estimator is a bounded coefficient of vari- 
ation. For simple functions, L(x) = x and maxjXj, mixture importance functions achieve 
such a goal, see Iyengar (1991) and Adler et al (2008). Madras and Piccioni (1999, section 
4) hint at the efficiency properties of dynamically selected mixture importance blankets. 
Gramacy et al (2010) propose the use of importance tempering. Johansen et al (2006) use 
logit annealing implemented via a sequential particle filtering algorithm. 

2.3.1 Discrete Cooling Schedule 

We suggest a simple, sequential, empirical approach to selecting a "cooling schedule" in 
our approach. Specifically, set m = 0, then given m t _i we sample xW ~ 7r mt _ 1 (x) ~ 
7r(x|L(x) > m t -i). We order the realisations of the criteria function L(x^) and set m t 
equal to the (1 — p)-quantile of the L(x^) samples. This provides a sequential approach 
to solving 

p = P (L(x) > m t |L(x) > mt-i) = P (L(x) > m t ) /P (L(x) > m t _0 = Z^/Z^ . 

A number of authors have proposed "optimal" choices of p, which implicitly defines a 
cooling schedule, m t , for < i < T. L'Ecuyer et al (2006) and Amrein and Kunsch (2011) 
propose p = e~ 2 and 0.2, respectively. Huber and Schott (2010) define a well-balanced 
schedule as one taht satisfies e _1 < p < 2e _1 . They show that such a choice leads to 
fast algorithms. The difficulty is in finding the right order of magnitude of M and the 
associated schedule m t that ensures that each slice Z mt /Z mt _ 1 is not exponentially small. 
For rare events, we sample until m t+1 > M and then set m T = M. Our initial estimate 
Z = p~ T and our weights are uj(m) = p m . 

In hard cases, such as the multimodal mixture of Gaussians, the normalising con- 
stants Z(m) are not exponential in m. In such cases we initialize the weights by a 
piecewise exponential obtained by interpolating any point m G (m t _i,m t ) by Q(m) = 
fi t _i exp(K t (m — m t _i)) where K t = log(Q t /^t-i)/( m t ~ m t~i)- For m > m T , we use 
Q(m) = fir- 
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2.4 Discussion 



2.4.1 Importance Sampling Interpretation 

The estimator Z m can be viewed as an importance sampling estimator where we average 
(L(x^)) over the splitting set L(xW) > m with x^ ~ 7r,ss( x )- Similarly we can express 
Z as an importance sampling estimator as follows. Given w(s) and Q(m) = f™ u(s)ds, 
consider importance sampling with a blanket proportional to fi(L(x)). Then 

Z = J L(x)tt(x) = j fi -1 (L(x))L(x) jjf ( } o;(s)ds| 7r(x)dx . (1) 

Split sampling uses a proposal distribution proportional to f2(L(x))7r(x). This can be 
viewed as the marginal from a weighted slice distribution 

0j(s)7r(x) 

7r ss (x,m) = -7^5 — — on < s < L(x) . 

J u s Z s ds 

The estimator of Z, from ([!]), can be viewed as the ratio of normalisation constants 

z r i - 

roc 7 A = / Q-^irssW -v/Z^ 1 ( xW ) where xW ~ ^s(x) . 

To estimate the normalisation constant, J °° u s Z s ds, we use the harmonic mean 

i N 

1/— J^fi-^xW) where x (i) ~ 7r ss (x) . 
Therefore, our importance sampling estimate becomes 

Z = V — ^ — — L(x^) . 

feEj^n-W))) 

We also observe that the marginal ordinate 7^5(0) = 1/ u s Z s ds with oj = 1- With this 
interpretation, if the MCMC algorithm spends significant time in the zero state of m, then 
we will be able to provide an accurate Monte Carlo estimate of vr5s(0) and hence of Z. 

2.4.2 Adaptive Mixture Interpretation 

Our methodology can be viewed as an adaptive mixture importance sampler. Umbrella 
sampling (Torrie and Valleau, 1997) can be seen as a precursor to many of the current 
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advanced MC strategies such as the Wang-Landau algorithm and its generalisations for 
sampling high dimensional multimodal distributions. These algorithms exploit an auxil- 
iary variable and by their adaptive nature improve estimates continuously as the simulation 
advances. The main difference is how each algorithm traverses low and high energy states. 
The Wang-Landau algorithm aims to achieve a uniform distribution on the auxiliary vari- 
able, thus spending more time in low energy states than high states states as opposed to 
multicanonical sampling (Berg and Neuhaus, 1992), l//c-ensemble sampling (Hesselbo and 
Stinchcombe, 1995) or simulated tempering (Geyer and Thompson, 1995). 

The Wang Landau algorithm was originally designed for discrete probability distri- 
butions. Here the state space is split into M disjoint bins; so, X = uf =1 Xi. When 
7r(x) = exp (— mi^(x)) /Z m , the algorithm provides estimates of the level energy sets 
#{x : i/(x) = u}. Liang (2005) provides a generalisation to continuous distributions. 
As in equi-energy sampling (Kuo et al, 2006), we define Xi = {x : u^i < H(x.) < Ui} and 
the algorithm provides estimates of g$ = j x 7r(x)dx. Both algorithms can be viewed as a 
discrete mixture importance splitter with H m {x) = I^ m (x). 

The equilibrium distribution itwl(x) spends equal time in each bin, thus disproportion- 
ally sampling low energy states. The target density is specified by weights ip t — f x vr(x)rfx 
and is given by 7ty(x) oc 7r(x)/^j( x ) where J(x) is the index such that x e Xj^y As the 
objective is to estimate (ifti, . . . , ipx), the Wang-Landau approach uses an adaptive MCMC 
algorithm based on the sequence of densities rr 9t (x.) oc 7r(x)/#t(J(x)) where 9 t is adaptively 
updated using a stochastic approximation scheme or a "flat-histogram" criterion. Fort et 
al (2012) discuss convergence properties of the Wang-Landau algorithm. 

2.4.3 Bridge Sampling Interpretation 

Much of the intuition of split sampling methods occurs when m G {0, 1} and we have a 
two component mixture of prior and posterior. Given u > 0, define 

'«W = i+^' w + i + M z z 

We will estimate the relative probability uZ of being in each component. This is related 
to the Savage-Dickey approach to calculating Bayes factors. 

To see this, we write the conditional distribution of the mixture indicator is 

... 7r(ra = l|x) L(x)7r(x)/Z . . 

7T (m x) ~ Ber{p{x)) where p(x) = — — = uZ — = wi x . 

ir{m = 0|x) ^(xj 
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As in slice sampling, we view ^55 (x) as a marginal from the joint distribution 



7r 55 (x,m) 



1 



tt(x) + 



ujZ I (L(x) > to) 7r(x) 



1 + 



1 + Z 



3 Comparison with Nested Sampling 

We now describe nested sampling (NS, Skilling, 2006). Here we use the ir density to 
calculate the ordered x-quantiles of the likelihood. Then if we assume that z(x) has a well 
defined inverse x(z) we have 



JL(x')>L(x) 

Under this change of variables, the equivalent identity approximates the expectation of 
interest, Z, via simple quadrature 



Nested sampling sequentially determines L t by sampling 7r(x|L(x) > L t _i). Brewer et al 
(2011) propose a diffuse nested sampling approach to determine the levels L t . Both nested 
and diffuse nested sampling are product estimator approaches. The quantiles = Lq < 
. . . < L t < . . . are chosen so that each level L t occupies p = e" 1 times as much prior mass 
as the previous level L t -\. Diffuse nested sampling achieves this by sequentially sampling 
from a mixture importance sampler X^=i w j^ (-^( x ) > ^j-i) 71 ( x ) "where the weights are 
exponential Wj oc e K( - j, ~'- ) for some k. MCMC methods are used to traverse this mixture 
distribution with a random walk step for the index j that steps up or down a level with equal 
probability. A new level is added using the (1 — e _1 )-quantile of the likelihood draws. Using 
diffuse nested sampling allows some chance of the samples' escaping to lowered contrained 
levels and to explore the space more freely. One caveat is that a large contribution can 
come from values of x(z) near the origin and we have to find many levels T to obtain an 
accurate approximation. 

Murray et al (2006) provide single and multiple sample versions of nested sampling 
algorithms. If L max = sup x L(x) is known, we sample as follows: set X — 1, N — 1, Z — 0. 

1. Generate x 1 ^ ~ 7r(x) and set L — 0,L 1 — L(x^). 
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2. If L max X < eZ, then set Z = Z + (X/N) Y% =1 h+j-i and stop. 

3. Repeat while LiX/N > 5: 

(a) Generate x^ +iV ) ~ tt(x)I (L(x) > L;_i) and set L* = L( X ( i+iV )). 

(b) Set N = N + l and sort Li's. 

4. Set Z = Z + LiX/N and N = N - 1, X = (1 - 1/N)X. 
If L max is not known, replace step 2 with: 

2(a) If Li+N^X < eZ, then set Z = Z + (X/JV) J2?=i ^i+j-i- 

We now choose ui{m) to match the sampling properties of nested sampling. The main 
difference between split and nested sampling is that in split sampling we specify a weight 
function oj{m) for < m < oo and sample from the full mixture distribution, rather 
than employing a sequential approach for grid selection which requires a termination rule. 
Another difference is that split sampling estimator does not need to know the ordered L t 's. 



3.1 Matching Split and Nested Sampling 

We take the sampling distribution for nested sampling from Skilling (2006). The expected 
number of samples less than m is — A^logZ(m). If N — 1, we have Z(Lj)/Z(Lj_i) are 
independent standard uniforms since 

- log Z(L k ) = - for /c > 1 . 

The distribution of the number of samples less than m is the number of arrivals before 
— logZ(m) of a Poisson process with rate 1. This generalises to arbitrary N. 

The sampling distribution of the nested sampling for finite n is hard to calculate, but 
we can observe limiting results. As n — > oo, if N/n — > A, then we have 

Z NS (m) = P iVS (L(x) > m) = 1 + lim — log Z(m) = 1 + A log Z(m). 

n,N^>oo Tl 

Split sampling has marginal density of x given by 

fi(L(x))7r(x) . , . [ m . , , 

ttss(x) = r oo V ",\ J , with Q(m) = / u(s)ds . 
J u{s)Z{s)ds J 
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The tail distribution function Z ss {m) = iiss(L(x.) > m) is then 

Jo Jx Jo u{s)Z s ds Z s 

J °° uj(s)Z{m V s)ds Z m Q(m) + w(s)Z s rfs 
j^°u(s)Z s ds f™u(s)Z s ds 

We now find the importance splitting density that matches the nested sampling distribu- 
tional properties in the sense that Z^sirn) = Zss( m )- Since 

Z' NS {m) = ^ and Z' ss {m) 



Z m ' ' J n x X\s)Zjls 

where Z' NS {m) = dZ^sim) / dZ m . We therefore set Q(m) = Z~ x (m). 

As we wish to traverse the full likelihood surface, we also introduce a condition that 
lets us monitor the number of visits, Ni eve i, to the current top level of the likelihood before 
we construct a new level. Specifically, we run 

1. Set T = 0, m = 0, Q = 1, and Z = 1. 

2. While T < T max , set T = T + 1 

(a) Simulate 7Tss(x) with {m t } <t<r and {Qt}o<t<T until we have Ni evd visits to 
level T — 1. 

(b) Choose the (1 — p)-quantile of likelihoods of level T — 1 as m T . 

(c) Set Z T = p~ T . 

(d) Set n T = Z T \ 

Under the condition fl(m) = Z{m)~ l , the chain will visit each level roughly uniformly. 
However, it may take a long time to reach the top level, and the uncertainty in Q may act 
like a hurdle for visiting upper levels. With these concerns, it is desirable to favor upper 
levels by replacing step (d) with 

(dl) Set Q T = e AT Z T \ 

We call A the boosting factor as A increases the preference for the upper levels. This 
reduces the search time and ensures the time complexity to be 0(T). To further expedite 
this procedure, we may put more weight on the top level T by substituting (d) with the 
step: 
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(d2) Set a 



■T-l = 



e^ T -^Z T \ and fi T = /3 



e AT -l 7 -l 



For example, if /3 = 1, the chain spends half of the time on the top level and the other half 
backtracking the other levels. 

Once we identify all levels, our split sampling algorithm runs: 



2. While i < n, set i — i + 1. 

(a) Draws using MH under weights fl(m), and set L { = L(xW). 

(b) Obtain M« = H -1 ^) where C/< ~ C/(0,ft(L;)) with Q(0) = 1. 

(c) For each t with m t < Li, update i/ t = i/ t + fi(Lj) -1 . 

(d) Update Z t = v t /vo and set Vt t = Zf 1 . 

4 Applications 

4.1 Rare Event Shortest Path 

Calculating rare event probabilities is a common goal of many problems. Rubinstein and 
Kroese (2004) consider the total length of the shortest path on a weighted graph with 
random weights x = (xi, . . . , x 5 ). Each weight Xj follows an independent exponential 
distribution with scale parameter Uj with joint distribution given by 



The goal is to estimate the probability of the rare event corresponding to the length of the 
shortest path 

Z(j) = P(S'(x) > 7) where 5(x) = min(xi + x 4 , x 1 + x 3 + x 5 , x 2 + x 3 + x 4 , x 2 + x 5 ) 
We will consider three cases: 7 = 2,3, and 4 where the true rare event probabilities are 



These can be estimated by the split sampler (SS) with L(x) = S'(x) and level breakpoints 
{0 = mo, mi, ... , rriT = 7}. Z T = Z{mT) is the estimator. 



1. Set % = and v t = is init Z t for each t — 0, 1, . . . , T. 




where u = (0.25,0.4,0.1,0.3,0.2) . 



Z(2) = 1.34 x 10~ 5 , Z(3) = 2.06 x 10~ 8 , and Z(A) = 3.10 x 10 



,-n 
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We implement three other competing estimators. First, the crude Monte Carlo (CMC) 
estimator simulates xW ~ 7r(x|u) and estimates the rare event probabilities by 



i=i 

Second, the conditional probability product (CPP) estimator Z{pj) calculates the (1 — p)- 
quantile m t+1 of N samples of S(x.^'^) under x^*'^ ~ 7Tt(x) oc I{s( x ) >mt }7r(x) for all t = 
0, . . . , T — 1 with m = 0, m T _i < 7, and uit > 7. This estimator is defined as: 



conditionals 7r(xj|x(_j), S'(x) > m) given by truncated exponential distributions. By the 
lack of memory property, we have 

7r(xi|x(_i), S'(x) > m) = max(0, m — x 4 , m — x 3 — X5) + x* where x\ ~ Exp{u\). 

The other conditionals 7r(xj|x(_j), S'(x) > to) follow in a similar manner. 

Third, the cross- entropy (CE) estimator (de Boer et al, 2005) calculates an "optimal" 
importance blanket, 7t(x|vt), parameterised by Vr- Then it draws Aq samples of x^ ~ 
7r(x|v T ) and estimates the shortest path probability 



The sequential algorithm for finding is similar in spirit to the product estimator ap- 
proach: set v = u and t — 1. Choose p; typically p = 0.1. Then perform 

1. Draw N samples of xW ~ 7r(x|v t _i). Let % be the (1 — p) quantile of S'(xW). 
If 7* > 7, set 7 t = 7. 

2. Update v t _i via cross-entropy minimisation; 





To find yS*'^ we need to sample 7r(x|5'(x) > m t ). We use Gibbs sampling with complete 




tt(xW|u) 
7r(xW|v T )' 



v t = 



Ejll I {S(xO)>7,} W ( x( ' > i U > Vf-l) x ' 
Eii I {5(xW ) >T, t} w(xW; U, V t _i) 



18 



Table 1: Rare event probabilities simulation 



7 = 2 7 = 3 7 = 4 

N CMC CE CPP SS~~ CE CPP SS~~ CE CPP SS~~ 

10 5 0.807 0.040 0.044 0.055 0.066 0.076 0.091 0.113 0.098 0.133 

10 6 0.275 0.011 0.015 0.015 0.017 0.025 0.026 0.028 0.033 0.036 

10 7 0.086 0.003 0.004 0.005 0.005 0.007 0.007 0.008 0.009 0.011 




Figure 1: Rare event 7 = 4 

3. If 7t = 7, set T = t and exit. Otherwise, set t = t + 1 and go to step 1. 

Table [T] provides the simulation results. Each scenario was run 100 times and relative 
RMS of each estimator was recorded. The CMC estimator was only recorded for 7 = 2 
as the other events are too rare to even have a single count. The total sample size, N, 
was 10 5 , 10 6 , or 10 7 . The tuning parameters for CE were p = 0.1, iVo = 1,000 for 7 = 2 
and N Q = 10,000 for 7 = 3,4, and Ni = N — TN . For CPP, we used p = e" 1 and 
iVo = N/T where T is the number of required steps. For split sampling (SS), we set 
p = e -1 , Ni eve i = 10,000 and v in n = 10,000 and A = 0.1. Cross-entropy (CE) method 
finds an efficient importance sampling function as does split sampling. Further gains from 
split sampling are expected in higher dimensional problems where finding v t at each stage 
in CE can be cumbersome in general. 
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f=2 y=3 7=4 




Figure 2: Histogram of Levels 
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Figure 3: Centered Gaussians. Learning oj{m) 
4.2 Normalisation of a Mixture of Gaussians 

As an illustration of the advanatges of using split sampling we consider a centered and de- 
centered mixture of Gaussians. We follow the nested and diffuse nested sampling literature 
(Skilling, 2008, Brewer et al, 2011) and suppose that x = (#i, . . . , x c ) where C = 20. 
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Centered 



Decentered 




NS1 NS2 NS3 NS4 SS NS1 NS2 NS3 NS4 DNS SS 



Figure 4: NS vs DNS vs SS 

The centered likelihood is given by the classic Gaussian "spike-and-slab" of width 0.01 and 
"plateau" of width 0.1, namely 



20 1 ,2 20 x2 

L C U) = 100 TT -^^e~^ + TT -^^e~^ 
y ' J- J- .Av., J- J- «/a 



. /27TU "7 V27Tt> 

8=1 1=1 

The prior vr(x) is uniform. In the de-centered multimodal mixture we take 

20 , 20 2 

TT 1 K-0.031) 2 „ 1 

L DC (x) = 100 If -^e + TT 



27ru ylnv 

1=1 1=1 

The goal is to calculate the so-called evidence, Z — J L(x)7r(x)dx = 101. 

For the centered case, nested sampling vastly outperforms annealing and provides Monte 
Carlo bounds given by log Z « log(101) ± H/N with H = j x log (7r L (x)/7r(x)) 7r L (x)<ix = 
63.2. For the de-centered case, diffuse nested sampling is preferable. 

For the de-centered case, the rare event normalising constants Z(m) are severely L- 
shaped. When the likelihood is multimodal, the chain needs to be able to go down one 
likelihood to traverse another mode. The mixture split sampling importance blanket is 
designed to achieve this goal. One caveat, however, is that a poor set of initial probabilities 
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Table 2: SS vs NS, u = 0.01, v = 0.1, centered at origin, rms of log(Z) with true value 
log(Z) = 4.615. The number of MCMC steps is reported per each NS step. 



Algorithm 


Parameters 


RMS 


NS1 


300 particles, 333 MCMC steps 


0.557 


NS2 


1000 particles, 100 MCMC steps 


0.260 


NS3 


3000 particles, 33 MCMC steps 


0.174 


NS4 


10000 particles, 10 MCMC steps 


3.647 


SS 


p = e~\ T max = 100, v = 5000, A = 10 


0.207 



and weights, Q(m) = Zim)* 1 , leads to an algorithm that results in very low probability 
estimates for the levels on the spike. The "flat-histogram" criteria described below can 
take a long time to fix this problem. We ensure an adequate number of samples at the last 
level by placing more weight on the final level by multiplying ut by T. 

We use the split sampling algorithm as described in (2.3.1). If the initial Z t are not 
as accurate as needed to guarantee good mixing of our MCMC iterations, we dynamically 
refine Q t as in step (d). The beauty of this scheme is that this update makes the chain 
self-balanced. When Q t is larger than it should be, or Z t is smaller, the chain visits level t 
more often. Thus increasing Z t and decreases Qt, which helps Z t converge more quickly to 
the true value. 

From a practical perspective, it is critical to have nonzero initial values on v t . If we 
start with v t = 0, the early Z t and Q t are unstable, and the whole procedure can become 
abortive. Although not very accurate, Z t = p~ l is a reasonable initial estimate and we 
sample long enough so the chain stabilizes. v in a represents the degree of dependence on 
those initial values. 

The final estimator is Z = v t /v and 

7 " mt — (Zt-Zt-Mmt-mt-i) 



r°° fin* 

/ Z{m)dm = S ^^ j Z t -\ exp(— Kf(m — m t -i))dm = 



t=1 j m t _ x t=1 log Z t - log Z t -1 

At first sight, the time complexity appears to be 0{nT) since steps (c)-(e) involve 0(T) 
operations. However, if the m t values are chosen so that Z t are exponentially decreasing, 
the work can be done in O(nlogT) time. The expected number of comparisons in step (c) is 
the reciprocal of the decreasing rate of Z t . The updates needed at steps (d) are only for the 
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Figure 5: Decentered Gaussians. Learning cu(m) 

last several fs since the increment Q(Li)' 1 becomes negligible very quickly relative to v t as 
t decreases. We use a random walk MH proposal distribution given by x* = Xj + iV(0, a 2 ) 
where the density of the step size f(a) oc 1/a on [10~ 4,5 , 10 1,5 ] for random chosen index j. 

Table 3: SS vs (D)NS, u = 0.01, v = 0.1, the spike centered at (0.031, . . . ,0.031), rms 
of \og(Z) with log(Z) = 4.615. The number of MCMC steps is per each NS step. Diffuse 



nested sampling (DNS, Brewer et al, 2011). 


Algorithm 


Parameters 


RMS 


NS1 


300 particles, 333 MCMC steps 


2.467 


NS2 


1000 particles, 100 MCMC steps 


2.338 


NS3 


3000 particles, 33 MCMC steps 


2.519 


NS4 


10000 particles, 10 MCMC steps 


2.620 


DNS 


Diffuse Nested Sampling 


0.763 


SS 


p = e~\ T max = 100, v = 5000, A = 10 


0.591 



Table [2] compares the performance of the nested sampling and the split sampling meth- 
ods. For each run, log Z were recorded and their root mean squares are reported. The 
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number of runs for each case is 500. For nested sampling, the sample size n is random and 
we instead fix N such that the average sample size becomes slightly greater than the target 
sample size n. 

We also compare split sampling with nested and diffuse nested sampling in Table 3. 
Split sampling provides a RMS error of 0.591 versus 0.763 for diffuse nested sampling. 



5 Discussion 

von Neumann's original view of importance sampling was a method for variance reduc- 
tion. By viewing the calculation of an expectation as a problem of normalising a posterior 
distribution, we can write 

7Tl(x) = L(x)7r(x)/Z where Z — I L(x)7r(x)dx . 

Jx 

Importance sampling uses a blanket g(x) to compute 

z = //w^w rfx -^E L ( xW )^ where * (l) ^(x). 

Picking (?(x) to be the posterior distribution L(x) leads to the estimator Z = Z with zero 
variance. While impractical, this suggests finding a class of importance blankets g(x) that 
are adaptive and depend on L(x) can exhibit good Monte Carlo properties. 

Split sampling specifies a class of importance sampling blankets, indexed by ou(m), by 

^ uj s ds \ 7r(x) 



S Jq W s US ? 7T^X| rL,[x) 

^ w (x) = L r , w \, — where fi(L(x)) = / u s ds . 
\ x S2(x)7r(x)dx J 

The advantage of the class of split sampling densities is that the resultant estimator of Z 
can be implemented via an auxiliary MCMC algorithm from a joint distribution tt,ss ( x ? m ) 
indexed by a random auxiliary variable m. Moreover, it allows an adaptive choice of cjjv(m) 
to reduce the Monte Carlo error of the resultant estimator. Convergence results rely on 
adaptive MCMC literature. 

Split sampling illustrates the adaptive importance sampling nature of nested sampling 
and cross-entropy methods. There is also a clear relationship with slice sampling (Poison, 
1996, Neal, 2003) as one can view the sampling of the posterior, 7Tl(x), as the marginal 
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from the augmented distribution 7r(x, m) = I(L(x) > m)n(x)/Z. The main difference is 
that split sampling runs a Markov chain that traverses the whole space defined by (x, m) to 
find regions where u(m) needs to be refined. Both CE and NS methods using a sequential 
sampling procedure as in the CPP estimator to split the quantity of interest into estimable 
pieces. Further research is required to tailor the specification of the weight function u(m) 
to the problem at hand. 

We leave open the question of an optimal choice of L m (x). Here we have focused on 
L m (x) = I(L(x) > m), however, using logit-type functions might led to faster converging 
MCMC algorithms. The key to the efficiency of split sampling is being able to construct 
a rapidly mixing MCMC algorithm to sample the mixture distribution ^^(x, m). We aim 
to report on direct applications in Bayesian inference in future work. For example, Murray 
et al (2006) shows that nested sampling performs well for Markov random fields models 
and split sampling should have similar properties. 
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